-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kirchhoff cuda integrated and tested successfully #617
base: dev
Are you sure you want to change the base?
Conversation
@Amnah2020 thanks for your contribution 😄 I have started looking at your code focusing on the case when
and get an operator that takes a numpy array as input, moves it to cupy, runs the entire
Of course, this can be further wrapped into a I have documented most of these points here https://github.com/PyLops/pylops_notebooks/blob/master/developement-cupy/Kirchhoff.ipynb Please reply to my questions before pushing any code, as I have made a significant amount of changes locally (the notebook will unlikely not work for you...), so based on your replies I can finalize them and push them before we continue finalizing the PR. PS: once we agree on those, I'll look in more details at the case when |
wrap = current_device.WARP_SIZE | ||
multipr_count = current_device.MULTIPROCESSOR_COUNT | ||
self.num_threads_per_blocks = (wrap, wrap) | ||
self.num_blocks = (multipr_count, multipr_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure about this. For my GPU (RTX 3090), the output of this:
from numba import cuda
current_device = cuda.get_current_device()
current_device.WARP_SIZE, current_device.MULTIPROCESSOR_COUNT
gives (32,82). This means that indipendently on the number of sources and receivers, you always have a 82x82 grid with 32x32 threads, which in many cases will call many blocks that will never pass the if ns > isrc and nr > irec:
condition. Am I getting this wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, each block always has a fixed number of threads, specifically (32, 32)
. The number of blocks, however, depends on the total number of threads required with a maximum of 82*82 grid. This implementation ensures that no block is allocated unless it will run at least one thread.
However, I need to fix the last else as it does not follow the same approach ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your reply. My point is that you code will ALWAYS allocate the same number of blocks no matter the size of the problem.
Take this code:
wrap = current_device.WARP_SIZE
multipr_count = current_device.MULTIPROCESSOR_COUNT
self.num_threads_per_blocks = (wrap, wrap)
self.num_blocks = (multipr_count, multipr_count)
on my GPU (RTX 3090), I get self.num_threads_per_blocks = (32, 32)
and self.num_blocks = (82, 82)
. Now later, you do:
self._trav_kirch_matvec_cuda[self.num_blocks, self.num_threads_per_blocks](x_d, y_d, nsnr_d, nt_d, ni_d, itrav_d, travd_d)
this will ALWAYS call 82x82 blocks no matter how many sources or receivers you have in your problem. So say for 64 sources and 64 receivers, you will have the first 2x2 blocks doing something and all the other running but just never passing this condition if ns > isrc and nr > irec:
.
In the new code, i changed this to:
self.num_blocks = ((self.ns + self.num_threads_per_blocks[0] - 1
) // self.num_threads_per_blocks[0],
(self.nr + self.num_threads_per_blocks[1] - 1
) // self.num_threads_per_blocks[1])
which is actually similar to what you had done for case 4. Here we don't launch any block that will not contribute to the final solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for correcting the kernel launch, as I did not follow the same approach as in the other cases.
I missed this part and ended up copying a lot of code because I didn't fork the repo before starting to make changes.
@staticmethod | ||
@cuda.jit | ||
def _travsrcrec_kirch_rmatvec_cuda(x, y, ns, nr, nt, ni, dt, trav_srcs, trav_recs): | ||
irec, isrc = cuda.grid(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is swapped compared to line 136?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for noticing! The tests have passed for a uniform number of sources and receivers.
However, I anticipate that if we vary these numbers, the kernel launch and setup may not adequately cover both. So, this line should remain consistent across all 2D grid launch calls and setups. I’ll make sure to enhance it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great! A good reccomandation is to never test codes with forced symmetries
as many bugs could pass unnoticed (i sometimes make this mistake myself 😆 )
itrav_d, travd_d) | ||
|
||
if not self.dynamic: | ||
cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed since for not self.dynamic
you use only the default stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the dynamic case, for each stream, I need to synchronize it with the main stream in a looped sequence. This synchronization happens within self._process_streams().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but you have if not self.dynamic:
so this will never be run in the dynamic case anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
@cako (aka our gpu guru): if you have time, it would be good if you can also have a quick look into this 😄 |
@Amnah2020 checking you have seen my review. After you have been able to answer my questions, I will push some minor code modifications and we will be able to progress to the second part of the PR. |
Maybe you could share the ToCupy function, and I can explore different strategies to leverage it. The goal would be to achieve a similar high-level call structure for full processing of Kirchhoff with multiple sources at once in the device. |
Thanks @Amnah2020, let me push the modifications to the code I already made (later today) so:
This will be already gone
I see your point and I am familiar with the
Let's discuss this after i push the updated cod.
Not sure what you mean?
No. You can do the same in Cupy or Numba. In either case if the array is on the CPU. In our experience, interoperating numba and cupy does not come with any overhead (and this is how PyLops work). We prefer using cupy for everything GPU related other than custom kernels (which is true, are easier to write in numba-cuda than cupy custom kernels), and I would like to make sure this new code follows the same philosophy to avoid creating too many different coding patterns without a real reason.
Ok, interesting. My understanding, from the link above, was that
Doing soon :) |
@Amnah2020 codes and notebook updated. I am also going to open a new Issue explaining in more details the philosophy behind |
Sorry, pretty busy at work, haven't been able to look at this yet :( |
No rush, we can wait for you ;) |
streams = [cuda.stream() for _ in range(num_streams)] | ||
sources_per_stream = num_sources // num_streams | ||
remainder = num_sources % num_streams | ||
source_counter = 0 | ||
self.sources_per_streams = {} | ||
for i, stream in enumerate(streams): | ||
num_sources_for_stream = sources_per_stream + ( | ||
1 if i < remainder else 0 | ||
) | ||
sources_for_stream = list( | ||
range(source_counter, source_counter + num_sources_for_stream) | ||
) | ||
self.sources_per_streams[stream] = sources_for_stream | ||
source_counter += num_sources_for_stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can substitute all this by:
# Split sources into different processing streams
self.sources_per_streams = {
cuda.stream(): sources
for sources in np.array_split(range(self.ns), self.nstreams)
}
# Check streams are unique
assert (
len(set(s.handle.value for s in self.sources_per_streams.keys()))
== self.nstreams
)
aperturetap_d = cuda.to_device(aperturetap) | ||
vel_d = cuda.to_device(vel) | ||
six_d = cuda.to_device(six) | ||
rix_d = cuda.to_device(rix) | ||
self.const_inputs = ( | ||
ns_d, | ||
nr_d, | ||
nt_d, | ||
ni_d, | ||
nz_d, | ||
dt_d, | ||
aperturemin_d, | ||
aperturemax_d, | ||
angleaperturemin_d, | ||
angleaperturemax_d, | ||
vel_d, | ||
aperturetap_d, | ||
six_d, | ||
rix_d, | ||
) | ||
|
||
self.trav_recs_d_global = cuda.to_device(trav_recs) | ||
self.angles_recs_d_global = cuda.to_device(angle_recs) | ||
self.trav_srcs_d_global = cuda.to_device(trav_srcs) | ||
self.angles_srcs_d_global = cuda.to_device(angle_srcs) | ||
self.amp_srcs_d_global = cuda.to_device(amp_srcs) | ||
self.amp_recs_d_global = cuda.to_device(amp_recs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much overhead these to_device
are adding, but if you'd like to make this more efficient, you can queue the memory transfers per stream and use pinned arrays to ensure async. Here's a snippet (untested):
streams = list(self.sources_per_streams.keys())
n = len(streams)
with cuda.pinned(
aperturetap,
vel,
six,
rix,
trav_recs,
angle_recs,
trav_srcs,
angle_srcs,
amp_srcs,
amp_recs,
):
aperturetap_d = cuda.to_device(aperturetap, stream=streams[0 % n])
vel_d = cuda.to_device(vel, stream=streams[1 % n])
six_d = cuda.to_device(six, stream=streams[2 % n])
rix_d = cuda.to_device(rix, stream=streams[3 % n])
self.trav_recs_d_global = cuda.to_device(trav_recs, stream=streams[4 % n])
self.angles_recs_d_global = cuda.to_device(
angle_recs, stream=streams[5 % n]
)
self.trav_srcs_d_global = cuda.to_device(trav_srcs, stream=streams[6 % n])
self.angles_srcs_d_global = cuda.to_device(
angle_srcs, stream=streams[7 % n]
)
self.amp_srcs_d_global = cuda.to_device(amp_srcs, stream=streams[8 % n])
self.amp_recs_d_global = cuda.to_device(amp_recs, stream=streams[9 % n])
cuda.synchronize()
y = np.zeros((self.ns * self.nr, self.nt)) | ||
elif opt == "_rmatvec": | ||
y = np.zeros(self.ni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these should be zeros(dtype=np.float32)
?
x_d = cuda.to_device(x, stream=stream) | ||
y_d_dict[stream] = cuda.to_device(y, stream=stream) | ||
isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be pinned to allow async:
with cuda.pinned(x, y, isrc_list):
x_d = cuda.to_device(x, stream=stream)
y_d_dict[stream] = cuda.to_device(y, stream=stream)
isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream)
x_d = cuda.to_device(x, stream=stream) | ||
y_d_dict[stream] = cuda.to_device(y, stream=stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are x
and y
being copied multiple times per stream? Not sure they should...
for stream in self.sources_per_streams.keys(): | ||
stream.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this can be substituted with cuda.synchronize, but that would also sync the default stream.
y = np.zeros((self.ns * self.nr, self.nt)) | ||
elif opt == "_rmatvec": | ||
y = np.zeros(self.ni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to create pinned memory y
arrays to facilitate copy to host. Example:
y = cuda.pinned_array((self.ns * self.nr, self.nt), dtype=np.float32)
stream.synchronize() | ||
y_streams = [] | ||
for stream, y_dev in y_d_dict.items(): | ||
y_streams.append(y_dev.copy_to_host(stream=stream)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When copying to host, it's better to provide a target array which is pinned. Example:
for stream, y_dev in y_d_dict.items():
ary = cuda.pinned_array_like(y_dev)
y_dev.copy_to_host(ary, stream=stream)
y_streams.append(ary)
aptap = ( | ||
aptap | ||
* aperturetap[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aptap *= ...
Several lines have the same pattern
Hey guys, great work! I left some comments, mostly around memory transfer assuming that I guess the question is, should CuPy or raw Numba arrays be used? To make sure we're all on the same page, for Numba kernels, it makes no difference. Numba and CuPy arrays implement the CUDA Array Interface so you can just call a cuda-jitted kernel directly on them. Same thing goes for PyTorch (not Jax, I think because they only support the v2 of the interface, not the latest v3). The difference comes from how memory is managed. With CuPy, memory is managed automatically via Memory Pool. The user requests memory and the memory pool decides when and how to allocate it. Versus Numba which leaves memory allocation entirely on the hands of the user. For higher level libraries, managed memory allocators have historically been favored. CuPy has its own, PyTorch, and RAPIDS. They make it simpler to deal with memory, generally offer better performance and allow the user to set global memory limits (useful when sharing resources). Personally, I feel like PyLops operators should do as little magic as possible internally, and it should be up to the user to manage memory. Of course, this cannot be done to the extreme, as there may be intermediate steps where memory allocation is required. In the case of the Kirchhoff operator, the behavior that would make more sense to me is the following: when selecting cuda backend, all input array must be consumable by Numba kernels. They can be CuPy, PyTorch, Numba, etc, and possibly a combination of them. The user should be responsible for allocating them. The return arrays IMO should be CuPy, ensuring that they use the currently active memory pool which is already available to most workflows (since we use CuPy extensively). |
Thanks @cako! I was waiting for @Amnah2020 to reply, but I'll go ahead and provide my feedback on your comments:
This is my understanding as well from other projects. If you have a cupy array and pass it to a numba-cuda custom kernel call there will be no-copy 'data transfer' between the two interfaces thanks to the CUDA Array Interface, so the user should not do any explicit call to move a cupy array into a numba one. So when we write a kernel that assumes the input to be already provided as cupy array we should not do any data movement - this is already the case in
Makes sense to me, and indeed I would favour managed memory allocation if/when that does not bring any downside as it simplifies the user experience.
Agree.
I also agree with this in general, but with a caveat. Let's take the non-dynamic codes first. The way I have adapted them in the PR fully embrace this philosophy. When the problem is small / GPU VRAM is large enough, the user can lift x/y to the GPU and then pass them to the matvec/rmatvec calls. And this is pretty much in line with any other PyLops operator GPU implementation. In general, we also give user the responsability to pass any additional parameter already as CuPy array (eg wavelet). However, in this specific case, the
So line 559 Now to the dynamic case. The way the code is currently set up departs a bit from the way the non-dynamic code is set up as it uses streams... I think the main motivation here is that since you have more My dream (not sure how to achieve this yet) would however be to lift this one level up, perhaps inside the ToCupy operator, so that one could run multiple GPU operators in a VStack/BlockDiag on different streams instead of embedding that logic in each single operator... but I think we can be pragmatic and finalize this PR as is and then if we achieve the dream solution, we can adapt the dynamic code to use it.
PS: I removed all codes that I had added for the |
No description provided.